[1] "Beautiful day"
[1] NA
[1] "Beautiful NA"
str_split(): pull apart raw string data into more useful variables
"10202"?"102a"? What about in this "1O2"?"2,32.1,0.4"!grep - global regex print. Is there a patern in a string?grepl - returns logical value. \ | ( ) [ { ^ $ * + ?\- escape character. - any (just one) character^ - begining of a string$ - end of string| - or sign() - group? - matches at most 1 times* - matches at least 0 times+ - matches at least 1 times{m} – matches exactly m times{m, n} – matches between m and n times{m, } – matches at least m times [1] "bell pepper" "blood orange" "canary melon"
[4] "chili pepper" "goji berry" "kiwi fruit"
[7] "purple mangosteen" "rock melon" "salal berry"
[10] "star fruit" "ugli fruit"
1. Collection of text document
2. Pre – processing of text
3. Text mining techniques
4. Analyze the text
5. Knowledge discovery
text <- c("Great white shark just ate my leg.","Not a wonderful day and days!")
text_df <- tibble(id = 1:2, text = text)
text_df %>%
unnest_tokens(word, text)# A tibble: 13 × 2
id word
<int> <chr>
1 1 great
2 1 white
3 1 shark
4 1 just
5 1 ate
6 1 my
7 1 leg
8 2 not
9 2 a
10 2 wonderful
11 2 day
12 2 and
13 2 days
the most common words in any language (like articles, prepositions, pronouns, conjunctions, etc)
not adding much information to the text
examples of a few stop words in English are “the”, “a”, “an”, “so”, “what”,…
why removing stop words; removing the low-level information from our text in order to give more focus to the important information
Do we always remove stop words? NO!
Before removing stop words, research a bit about your task and the problem you are trying to solve, and then make your decision!
many packages:
[[1]]
[1] "The hotel is ideally located and is in a beautiful building."
[2] "Most of the staff are very polite and helpful."
[3] "Rooms are comfortable and it has a serviceable gym."
[4] "Avoid going to breakfast before 0700 or wearing flip flops or slippers, you will be admonished and sent back to your room to change."
[[2]]
[1] "The hotel is a short walk to the pedestrian mall, restaurants and cafes."
[2] "The hotel is an old historical landmark."
[3] "I loved the tall ceilings, lobby and restaurant."
[4] "The bathroom has been updated and is very nice."
[5] "The breakfast buffet is very good with many options and you can eat outside."
[6] "We enjoyed our stay here."
attr(,"class")
[1] "get_sentences" "get_sentences_character"
[3] "list"
dfJR<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_jockers_rinker)
dfSE<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_sentiword)
dfJR element_id word_count sd ave_sentiment
1: 1 52 0.3861717 0.2993681
2: 2 56 0.2501910 0.2914671
element_id word_count sd ave_sentiment
1: 1 52 0.1763900 0.14263246
2: 2 56 0.1420433 0.02832483
sentences<-"The great white shark just ate my leg!"
dfJR<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_jockers_rinker)
dfSE<-sentiment_by(sentences,polarity_dt=lexicon::hash_sentiment_sentiword)
dfJR element_id word_count sd ave_sentiment
1: 1 8 NA -0.03535534
element_id word_count sd ave_sentiment
1: 1 8 NA -0.04787702
dataNYC$pickup_datetime<-ymd_hms(dataNYC$pickup_datetime))[1] "LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"
[1] "01-03-18"
[1] "01-Mrz-2018"
[1] "01-März-18"
Use sys.getlocale and sys.set.locate to:
In this exercise you will work with the date, “1930-08-30”, Warren Buffett’s birth date! Mind the locale language!
[1] "2010-01-01"
[1] "2009-02-10 00:10:03 UTC"
uros.godnov@gmail.com